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Abstract 

Most of the existing information retrieval systems are based on bag of words 
model and are not equipped with common world knowledge. Work has been 
done towards improving the efficiency of such systems by using intelligent 
algorithms to generate search queries, however, not much research has been 
done in the direction of incorporating human- and- society level knowledge in 
the queries. This paper is one of the first attempts where such informa- 
tion is incorporated into the search queries using Wikipedia semantics. The 
paper presents an essential shift from conventional token based queries to 
concept based queries, leading to an enhanced efficiency of information re- 
trieval systems. To efficiently handle the automated query learning problem, 
we propose Wikipedia-based Evolutionary Semantics (Wiki-ES) framework 
where concept based queries are learnt using a co-evolving evolutionary pro- 
cedure. Learning concept based queries using an intelhgent evolutionary pro- 
cedure yields significant improvement in performance which is shown through 
an extensive study using Reuters newswire documents. Comparison of the 
proposed framework is performed with other information retrieval systems. 
Concept based approach has also been implemented on other information 
retrieval systems to justify the effectiveness of a transition from token based 
queries to concept based queries. 
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1. Introduction 



A central challenge in building expert systems for information retrieval 
(IR) is to provide them with common world knowledge. As succinctly put by 



Hendler and Feigenbaum [16[ , in order to build any system with "significant 
levels of computational intelligence, we need significant bodies of knowledge 
in knowledge bases". That is, if a system is expected to understand the 
general semantics in text, closer to the way human brains do, then it should 
have access to the extensive background knowledge that people use while in- 
terpreting concepts (units of knowledge) and their dependencies. Of course, 
statistical methods and natural language processing can be used to extract 
semantics from text or data, but the ability of text collections to convey hu- 



man and society-level semantics is quite limited |48|. Currently, there is an 
ongoing quest to find new ways of integrating semantic knowledge into doc- 
ument modelling without time-consuming engineering. One of the emerging 
trends is to use socially developed resources of semantic information. 

In this paper, we consider the use of Wikipedia as a source of common 
world knowledge for an automated query learning system. The purpose is to 
assist users to express their information needs as queries which are written 
in terms of Wikipedia's concepts instead of word tokens. The proposed sys- 
tem extends the Inductive Query By Example (IQBE) paradigm of Smith 



and Smith [42[ and Chen et al. [5|| by incorporating human-level semantics 
using Wikipedia. The underlying principle of IQBE is quite simple: assume 
that a user provides a small collection of relevant (and irrelevant) example 
documents, the task is to learn a query based on those documents. The 
learnt query is then used to filter relevant documents from a newstream or 
document database according to the topic definition implied by the sample 
collection. The approach proposed in this paper uses concept-relatedness in- 
formation contained in Wikipedia's link-structure to learn semantic queries 
using an intelligent co-evolutionary procedure. This transition from an or- 



dinary boolean query [39[ to a semantified query is necessary for integrating 
human and society-level semantic information into the information retrieval 
(IR) system. The integration of concept-based knowledge into the IR system 
enables it to detect the relevance of a document based on concepts and not 
just words. It also allows the system to identify those documents as relevant 
which contain concepts closely related to the query concepts. The paper con- 
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tributes towards construction of an IR framework where Wikipedia-concept 
based queries are learnt using a co-evolving genetic programming (GP) algo- 
rithm. The proposed framework is called Wiki-ES (Wikipedia-based Evolu- 
tionary Semantics). 

The traditional automated query learning systems usually represent both 
queries and documents using a bag-of-words approach. Moreover, the recent 
studies on IQBE paradigm have almost exclusively focused on finding the best 
evolutionary algorithms and fitness functions for learning boolean queries; 
see e^. Cordon et al. 0, Isf, Garcia and Herrera 14 1, and Lopez- Herrera et 



al. [22|, |21[ . These developments have been well- motivated by the fact that 
the classical Boolean IR model is still broadly used, and there is considerable 
demand for query learning systems which can be run on top of any Boolean IR 
platform. However, restricting the query and document models to word-level 
information eliminates the possibility of leveraging human-level semantics on 
how the different topics and concepts are related. It should be noted that a 
query is composed of a number of concepts, and it represents the topic the 
user wants to search. To illustrate the difference between word based search 
and concept based search, consider a situation where a user is searching 
for information on a particular topic, for which he crafts a simple query 
"economy AND espionage". Then, suppose that a newly arrived document 
has concepts "Trade secret" and "spying". If we now ask a human reader 
to judge whether the document is about economic espionage, he would most 
likely find it relevant due to the close relationships between the concepts. 
However, if only word-level information is used, the boolean query will ignore 
the document as the original query words never appear. 

In this paper, we focus on the benefits of using concepts instead of bag- 
of-words in query learning. As a test-bed for Wiki-ES system, we consider 
TREC-11 dataset with Reuters RCVl corpus which provides a realistic ex- 
ample of a multi-domain news-stream. The experiments suggest that the 
concept-based approach is well-fitted to be used in conjunction with evo- 
lutionary algorithms. First of all, we observe that replacing tokens with 
Wikipedia's concepts yields significant improvement in filtering results as 
measured by precision and recall. The achieved improvements are consid- 
erable against other benchmarks as well, such as Support Vector Machines 
(SVM) and the decision-tree algorithm C4.5, which are well-known for their 
robust performance. Furthermore, when comparing the complexity of the 
query-trees produced by Wiki-ES against those of word-based IQBE-GP, we 
find that the use of concepts leads to simpler queries. 
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The structure of this paper is following. Section [2] summarizes the main 
contributions of the paper. Section Ogives a review on IQBE model for auto- 
mated query learning, and how Wikipedia can be used as a source of semantic 
information. Section H] presents our framework Wikipedia-based Evolution- 
ary Semantics (Wiki-ES). The co-evolutionary GP algorithm is presented in 
Section O Finally, Section summarizes the experimental results. 

2. Contributions 

The key contributions of the paper are summarized in the following points: 

2.1. Use of Wikipedia semantics in query learning 

When a set of documents concerning a particular topic is to be retrieved 
from a database, it is common for a user to generate a query composed 
of token words. This query is used to decide the relevance of documents 
in a database by performing a search for the tokens in those documents. 
However, analyzing the problem from a user point of view, it is recognized 
that the user is not just interested in the documents containing the exact 
matching tokens, rather she is seeking all such documents which contain 
the concept represented by the token. This provides a motivation to work 
towards generating queries composed of concepts rather than tokens. Queries 
composed of concepts contain a wide human and society level knowledge, 
providing a better representation of the topic being searched. In this paper, 
we use Wikipedia semantics to convey the concept behind a token. There 
is no previous study to the knowledge of the authors, which utilizes the 
Wikipedia semantics to construct a concept based query. The efficacy of this 
transition from tokens to concepts, towards retrieval of documents, has been 
evaluated in the paper and its significance has been established. 

2.2. Development of a co- evolving GP 

Generating an accurate query for a search is often an iterative and tedious 
task to perform. However, if there is a set of documents available at hand, 
with each document marked relevant or irrelevant, the task of query gener- 
ation can be entirely avoided by directing the documents to an intelligent 
algorithm. Based on the relevance or irrelevance of the training documents, 
a concept based query can be learnt by the algorithm, saving the user from 
a monotonous task. The paper contributes towards development of a co- 
evolving evolutionary algorithm specialized to generate concept based queries 
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for document retrieval. The algorithm takes a set of training documents as 
input. Each document in the training set is marked as relevant or irrelevant 
by the user, based on which the algorithm produces concept based queries. 
The outcome of the algorithm is not a single query, rather a set of queries 
which are put together using a voting function. The use of multiple queries 
and a voting function leads to avoidance of any over-fit to the training set 
which may happen if only a single query is generated. Multiple queries pro- 
duced by the algorithm, occupy different high fitness niches in the objective 
space and contribute towards the final decision for a document being relevant 
or irrelevant. Though genetic programming has been widely used for query 
construction, the implementations have usually involved, producing a single 
best fit token based query. 

2. 3. Comparison of existing methodologies for Information Retrieval 

The paper performs a comprehensive comparison study of the existing 
methodologies. In addition to the proposed framework, other frameworks 
have also been evaluated on hundred different topics and the results have 
been presented. The concept based query construction approach has been 
implemented on the existing frameworks as well and significant improvement 
in results has been obtained for all the methodologies. Detailed evaluation 
results have been presented for the proposed framework against its closest 
competitor. In the extensive simulations performed, the proposed framework 
is found to outperform the other commonly used methodologies. 



3. Prelude: Wikipedia semantics and IQBE 

To provide an idea on the wealth of Wikipedia's semantic information 
and how that information can be utilized in query learning, we briefiy dis- 
cuss the recent innovations which leverage Wikipedia's link-structure to pro- 
duce low-cost measures on the relationship between concepts and topics. In 
this section, we also summarize the recent developments in automated query 
learning. In particular, we consider the work inspired by evolution-based ge- 



netic algorithms, and the IQBE paradigm of Smith and Smith j42|] and Chen 
et al. |5[. 

3.1. Wikipedia as a semantic knowledge-resource 

Research on ontology-based knowledge models has been largely motivated 
by their ability to provide unique definitions for concepts, their relationships 
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and properties, which together create a unified description of a given do- 
main. Having access to such structured information in machine-readable 
form has provided standardized ways for sharing common knowledge and, 
thus, enabled its efficient reuse in applications. Despite these advantages, 
the use of ontologies has been hmited because of the large engineering costs 
that are unavoidable in manually built knowledge-resources. Furthermore, 
there is the difficulty of keeping the resources updated, in particular, when 
multiple domains are considered. As it is commonly known 43|, |25|, even 



the most extensive ontologies, such as the Cyc ontology, have limited and 
patchy coverage. Therefore, the urgent need to find less expensive ways 
to describe concepts and their dependencies is well recognized. This has 
motivated research towards the use of socially or automatically constructed 
knowledge-resources. 

When speaking of readily accessible multi-domain knowledge resources, 
the one that instantly comes into the mind is Wikipedia. Thanks to the 
activity of numerous volunteers, Wikipedia has rapidly matured into one of 
the largest repositories of manually maintained knowledge. Today, there are 
already over 3.3 million articles in English Wikipedia, and more arrive on a 
daily basis. The popularity of Wikipedia has also stimulated increasing re- 
search to investigate how the mountains of semantic information in Wikipedia 
can be harnessed for good uses; see Medelyan et al. 25|] for a comprehensive 
review. As pioneering research in this field, we acknowledge the work don e by 
Ponzetto and Strube 
Milne et al. 



36l. l38l . l37l . |44| , Gabrilovich and Markovich H 



29|, 132|, |30|, |31|, Medelyan et al. [26|, |24 



Nastase et al. |33|, |34| . 



and Mihalcea and Csomai [28|, who have examined different ways of using 
Wikipedia to compute semantic relatedness between concepts and perform 
automated cross-referencing of documents. 



3.1.1. Wikipedia- concept 

In spite of the fact that Wikipedia does not really fulfill the criteria of 
being an ontology, a closer look at its structure reveals many similarities [17 . 
By interpreting Wikipedia's articles as concepts, and by regarding the over- 
all link-structure - including redirects, hyper-links, and category links - as 
relations, it is warranted to argue that Wikipedia is the largest semantic net- 



work available today. As nicely captured by Medelyan et al. [25|, Wikipedia 



provides a solid middle-ground between ontologies and classical thesauri "by 
offering a rare mix of scale and structure" . Indeed, the recent developments 
suggest a number of ways in which Wikipedia can be used for extracting 
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ontological knowledge; for example, see the Yago-ontology of Suchanek et 
al. ||45| and WikiNet by Nastase et al. |34 . 
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<Category> 
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<Category> 
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Figure 1: "Goldman Sachs" as a Wikipedia-concept 



The primary feature that makes Wikipedia considerably richer in se- 
mantic knowledge than a conventional thesaurus is its dense internal link- 
structure. To illustrate the notion of Wikipedia-concept a bit more closely, let 
us consider, for instance, Wikipedia's article on "Goldman Sachs" (Figure [1]). 
Each Wikipedia-concept (article) belongs to at least one or more categories, 
which provide information about broader topics, hyponyms and holonyms. In 
this case, we find that Goldman Sachs belongs to categories such as "Invest- 
ment Banks" and "Banks of the United States". Moreover, if the article's 
topic is sufficiently broad, then there also exists an equivalent category with 
the same title as the article. In addition to category-relationships, the ar- 
ticles have lots of hyper-links that represent semantic relationships between 
concepts. On average, each article refers to about 25 other articles. For 
instance, "Goldman Sachs" has links to many other banks (e.g. "Morgan 
Stanley") and financial concepts (e.g. "Subprime mortgage crisis). These 
linkages can be exploited in various ways to mine knowledge on concepts and 
their relationships. Finally, to account for synonyms and alternative spellings 
of the article's name, each article has also a number of redirects that connect 
to the article. The redirects are complemented by anchors, which represent 
the words used within hyper-links that refer to the given article; and when 
several articles could be given the same name (e.g. Bank), then there is a 
disambiguation page that lists the alternative senses corresponding to that 
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name. 

Therefore, considering the wealth of semantic information conveyed by 
Wikipedia, we find it natural to treat the Wikipedia-articles as equivalents 
for ontological concepts when modelling documents and queries. To formalise 
these ideas, we employ the following notation while referring to Wikipedia- 
concepts: 

Definition 3.1.1 (Wikipedia-concept). Let W denote the collection of 
Wikipedia-articles available for language S. Then a Wikipedia-concept is 
defined as an article w G W , which is a uniquely identified representative of 
a certain concept. 

Once we have the definition, there at least two questions that follow. The 
first one is concerned with concept-recognition. Clearly, it is not uncommon 
to find that several concepts may share the same textual representation. 
Thus, being able to resolve whether a certain concept is present in a document 
or not is a non-trivial problem. In the literature, this is commonly referred 



to as the wikification task [28] or automatic topic-linking problem [30|, l26 
This will be discussed more closely when outlining the content model used 
by Wiki-ES; see Section S^l 

The second question, discussed in the following Section [3. 1.2[ concerns the 
way semantic relatedness between any (concept, concept)-pair and (concept, 
document)-pair is measured. This needs to be resolved before we discuss the 
idea behind Wikipedia-based query rules and the way they are learned from 
example documents provided by a user. In particular, we need the notion 
of semantic relatedness while evaluating whether a document matches the 
given query or not. 

3.1.2. Measuring semantic relatedness 

Although approaches to measuring conceptual relatedness based on cor- 
pora or WordNet have been around for quite long (McHale [isj and Finkel- 
stein et al. lol|), the use of Wikipedia as a source of background knowledge 



is a relatively new idea. The first step in this direction was taken by Strube 
and Ponzetto ji^], who proposed their WikiRelate-technique that modified 
existing measures to better work with Wikipedia. This was soon followed by 



the paper of Gabrilovich and Markovitch [12[, who suggested explicit seman- 
tic analysis (ESA) to define a highly accurate similarity measure using the 
full text of all Wikipedia articles. 
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The most recent proposal is, however, the Wikipedia Link-based Measure 



proposed by Milne et al. [29|, |30[, where only the internal link structure 
of Wikipedia is used to define relatedness. The approach is known to be 
computationally very cheap and has still achieved relatively high correlation 
with humans, which is why we have adopted it as a basis for the similarity 
measures used in this paper. The relatedness measure essentially corresponds 
to the Normalized Google Distance inspired by Cilibrasi and Vitanyi p] 



Definition 3.1.2 (Link-relatedness (Milne et al. [29|, |30|)). Letwi and 
W2 be an arbitrary pair of Wikipedia- concepts , and let Wi,W2 C W denote 
the sets of all articles that link to Wi and W2, respectively. The link structure 
-based concept-relatedness measure, link-rel: W x W ^ [0,1] , is then given 
by 

link-rel{wi,W2) 



log (max 






W2\)-iog{\WinW2\) 


hg{\W\)~ 


- log (min ( Wi 




W2\)) 



Remark 3.1.3. Although, this link-based relatedness measure is defined only 
for uniquely identified Wikipedia- concepts, it can be extended for calculating 
relatedness between any given pair of n-grams by using our knowledge about 
redirects and anchors attached to different concepts. 



The underlying principle of link-rel is rather simple: if two articles share 
a lot of same links, then they are likely to be highly related. For example, 
if we consider two major investment banks, such as "Goldman Sachs" and 
"Morgan Stanley" , the link-rel yields a relatedness score of almost 80 percent 
due to the large number of financial concepts shared by both bank-articles. 
Whereas "Goldman Sachs" and "Football" are percent related. Of course, 
these results are sensitive to the quality of the concept-articles' link-structure, 
and can thereby vary depending on the version of the Wikipedia being used. 
Nevertheless, when well-established articles are considered, and when speed 
is essential, we find that this kind of graph-based approach has proven to be 
a reasonably reliable way of measuring proximity between any arbitrary pair 
of concepts. 

So far, we have considered the computation of semantic relatedness in its 
conventional setup between two concepts. However, given our intention to use 
Wikipedia's relatedness information in matching queries and documents, it is 
perhaps more relevant to ask: how can we measure the relatedness between 
a document and a given concept? Or how likely is it for the given concept to 
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appear in the document? For this purpose, we propose the following simple 
extension of the link-relatedness measure. 

Definition 3.1.4 (Document-concept relatedness) . Let w E W denote 
any Wikipedia- concept, and d E V be an active document. The Wikipedia- 
hased document-term -relatedness measure, d-rel : 14^ x D — )• [0, 1], zs given 

by 

d-rel{w,d) = ma.x{link-rel{w,w) : w E A{d)} 

where the document model, A{d), is interpreted as the collection of Wikipedia- 
concepts detected in document d; see Section \4-S\ for further discussion on 
document modelling. 

Here, the use of maximum rather than sum-based operator such as average is 
a deliberate choice. Since d-rel is intended to be used in evaluating whether a 
document matches a given query, we do not want to allow any sum-operations 
to mask the presence of those concepts in a document which are not related to 
its central theme. To illustrate the idea, consider a single-concept query for 
documents on "Industrial espionage". Now, suppose that we receive a large 
document on car manufacturing, where most of the discussion is concerned 
with general economics and car models. However, the document still has a 
single paragraph on stolen trade secrets and car-prototype specifications. In 
order to prevent the document's main theme from hiding its relatedness to 
industrial spying, we choose to measure the relatedness by using the concept 
that is best associated with espionage. In this particular case, because trade 
secret is strongly linked to industrial espionage, it is natural to use their 
association to evaluate the overall relatedness between the document and 
the given query. 

3.2. Query learning problem 

The demand for automated query learning is driven by the difficulty of 
formulating effective queries that match the user's information needs. Find- 
ing appropriate search terms and conditions is generally hard even for expert 
users. Therefore, given a certain topic, the task of query learning systems 
is to help the user to find a query definition with improved precision and 
recall. As the size of world's information base is growing at a staggering 
rate, the problem is becoming increasingly pressing. To alleviate it, a large 
number of competing solutions for query formulation have been proposed 
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in response. As suggested by Cordon et al. these can roughly be cate- 
gorized into three baskets: (1) term learning; (2) weight learning; and (3) 
query-structure learning. 

The commonality of the approaches is their reliance on some form of 
relevance feedback, where the system elicits (possibly iteratively) a set of 
feedback statements from the user. In the first two model categories, rele- 
vance feedback is used for modifying the user's previous query by removing or 
adding terms and adjusting their weights to better reflect the user's relevance 
judgements. For example, many of the probabilistic models and document- 
vector modification models belong to these categories; see e.g. Salton and 



Buckley [41| and Rocchio l40|, Yang and Korfhage [47[, Horng and Yeh [18 
and Boughanem et al. [^T 

Our focus is on the third category, query-structure learning, which takes 
the learning process one step further in the context of boolean or fuzzy 
boolean queries. It not only attempts to infer the terms that are most ap- 
propriate for representing a given query but also tries to learn the query's 
structure, i.e. it determines how the boolean operators AND (A), OR (V), 
and NOT (-i) should be used to join the different concepts. In many texts, 
query learning is considered as a reserved word for representing this third 
type of query definition, where both the functional form and concepts of 
the query are free variables; see e.g. Cordon et al. 0, Isf, Lopez- Herrera et 



al. |22l . |21| and their references. The IQBE paradigm (Section 13.31) and the 
Wiki-ES system introduced in this paper are mainly viewed as structural 
query learning models. Therefore, for the rest of this paper, we will use the 
following general definition to refer to query learning problem. 

Definition 3.2.1 (Query learning problem). Let C be a set of admissi- 
ble concepts, and let Q denote the space of all admissible queries which can 
be formed using concepts in C . The query learning task is to find that boolean 
expression from the set Q which best represents the user's information needs 
by applying the following syntactic rules: 

1. Atomic query (single concept): Wq = Ci&C-^q&Q 

2. Composition using AND: 'iq,p&Q^qApEQ 

3. Composition using OR: 'i q,p E Q ^ q\/ p E Q 

4. Negation: ^qeQ-^^qeQ 

The space of admissible queries Q consists of all the queries obtained by 
applying the above set of rules. 
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There are many ways to approach the above problem - both with and without 
the use of semantic knowledge. At this stage, we notice that the definition 
remains deliberately abstract by not specifying how the set of concepts should 
be understood and how the learnt queries be matched against documents. Of 
course, when classical boolean queries using the bag-of-words approach are 
considered, the answer is quite straightforward. However, when the atoms 
of a query are uniquely defined concepts, it is no longer self-evident how the 
query should be evaluated. In fact, as we find out in Wiki-ES model, the 
performance differences between concept-based and word-based approaches 
follow from the way concept-relationship information is used while matching 
documents with learnt queries. 

3.3. IQBE - Inductive Query By Example 

One of the best known bag-of-words based methods for solving the query 
learning problem 13. 2. II is the Inductive Query By Example (IQBE) framework 



originated by Smith and Smith 42| and Chen et al. [5|]. The idea behind 
IQBE paradigm is in principle very similar to relevance feedback; both of 
them require explicit relevance statements from the user to guide the retrieval 
process. In IQBE, the user provides the system with a collection of sample 
documents (positive/negative examples) from which an algorithm learns the 
terms and the boolean operators joining them, such that the obtained query 
best represent the user's information need. However, instead of modifying an 
existing query iteratively, the system performs only a single run to generate 
a fresh query from the scratch. Once the learnt query is available, it can be 
executed on any boolean information retrieval system (IRS). Such portability 
of queries can be considered as one of the characteristics that distinguishes 
IQBE systems from general relevance feedback. In descriptions of IQBE 
architecture, this is commonly emphasized by presenting IQBE system as a 
separate unit outside the IRS; see Lopez-Herrera et al. |20| and Figure [2] for 
descriptions of a general IQBE system. 

In IQBE framework, the query learning task is viewed as a large optimiza- 
tion problem, where the search space consists of all possible queries that can 
be presented to the IRS. Therefore, recognizing the high dimensionality of 
this problem, it is no surprise that the IQBE approaches usually rely on some 
form of evolutionary computation. In particular, following the early studies 
by Kraft et al. i^] and Smith and Smith 42], genetic programming 19] has 



gained ground as a robust choice for query learning. Recently, a number 
of frameworks based on multiobjective genetic programming have also been 
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Figure 2: A general IQBE architecture 



examined. Due to the fact that the performance of an IRS is mostly eval- 
uated in terms of precision and recall, it appears natural to consider query 
learning as an inherently multiobjective problem. For interesting applica 
tions of multiobjective evolutionary algorithms, see e.g. Cordon et al. 



and Lopez-Herrera et al. 22|, |21 



As discussed by Tamine et al. j46|, the popularity of evolutionary algo- 



rithms is largely explained by their implicit parallelism which allows them to 
search different regions of the solution space simultaneously. It is also argued 
that evolutionary algorithms are less sensitive to the quality of the initial 



query. Whereas classical relevance feedback methods, such as Rocchio |40 
perform poorly if the initial query fails to retrieve relevant documents. The 
probabilistic exploration induced by evolutionary algorithms permits them 
to search unexplored areas independent of the initial query jij. Hence, the 
use of evolutionary algorithms is a well-justified choice for query learning 
as non-expert users can rarely find a good query on a first try when more 
complicated topics are considered. 

Although the automated query learning problem has stimulated a lot of 
interest over the past few years, it is noteworthy that majority of the de- 
velopment has concentrated on improving learning algorithms rather than 
examining the role of query and document representations. However, recog- 
nizing the fact that the use of semantic information has transformed many 
natural language processing applications [48], we consider it worthwhile to 
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work towards the development of a Wikipedia-concept based approach which 
would enhance automated query learning. 

4. Wiki-ES: Learning semantic queries with Wikipedia 

In this section, we present the Wiki-ES (Wikipedia-based Evolutionary 
Semantics) framework for automated query learning. The approach is based 
on the Genetic Programming (GP) paradigm, which is a potent tool in arti- 
ficial intelligence for performing program induction. In GP, the idea is to use 
the principles of evolutionary computation to intelligently search the space 
of possible computer programs for finding an individual that is highly fit for 
solving the problem at hand. In effect, one could say that the purpose is to 
get the machine to generate a solution to the problem without being explic- 
itly programmed [l9| . For example, in our case we want the Wiki-ES system 
to learn a program (i.e. query) that leads to recovery of a high number of 
relevant documents while keeping the irrelevant documents aloof. The learn- 
ing process is driven by the evolutionary pressure which guarantees that only 
the fittest individuals among all potential query candidates survive. 

4-1- Wiki-ES framework overview 

A bird-eye's view of the Wiki-ES framework resembles the architecture of 
the IQBE paradigm (see Figure |2]), where the idea is that the system is able 
to learn an optimal query by using just a small set of sample documents that 
represent the user's current topic or information need. On the surface, this 
sounds simple. However, when examining the steps involved in the learning 
process, it becomes clear that a number of choices, ranging from the choice 
of query and document models to the choice of the genetic procedure, have 
large impacts on the outcome. 

To illustrate the way Wiki-ES approaches the query learning problem, let 
us consider an example where a user seeks to define a query that picks up 
all the documents on economic espionage but ignores the ones on politics or 
military espionage. Then, we can split the Wiki-ES process into the following 
steps (see Figure E]): 

1. Training data generation: Suppose that the user has already found a 
bunch of documents that she considers highly relevant for the topic and 
also a collection of documents that are concerned with espionage but 
are more about military spying than industrial espionage. Then, the 
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training data set is defined as a relevance matrix, where each sample 
document is given a boolean value to represent its relevance for the 
topic (l=relevant, 0=irrelevant). 

2. Learning an intelligent query, i.e. the Wiki-ES rule: In the learning 
step, the training data set is given to the GP-algorithm to find an 
optimal Wiki-ES rule to describe the topic. Each Wiki-ES rule consists 
of a number of queries, which allows the rule to take into account not 
only the concepts which appear directly in the query-expressions but 
also the ones which are strongly related to them. A detailed description 
of the rules is given in Section 14.31 The GP-algorithm is described in 
Section O 

3. Feeding the Wiki-ES rule and documents to the Wiki-based Information 
Retrieval System (WIRS): Once the optimal Wiki-ES rule is known, it 
can be given to a matching subsystem which evaluates the query against 
the incoming documents. In Wiki-ES framework, this task is handled 
by WIRS module, which consists of two subsystems: the document 
modeling subsystem and the rule-matching subsystem. 

(a) Document modeling subsystem: Before the incoming documents 
can be matched against the Wiki-ES rules, they are passed through 
a wikifier and a named-entity recognizer (NER). The resulting pro- 
file, expressed in terms of the identified Wikipedia concepts and 
named-entities, can then be used to represent the document con- 
tents when matching against Wiki-ES rules; see Section 14.21 for 
description of the document model. 

(b) Rule-matching subsystem: The rule-evaluator in WIRS module 
provides a matching subsystem for deciding whether a given doc- 
ument matches the currently active semantic rule or not. In Wiki- 
ES framework, it is hence the responsibility of the rule-evaluator 
to utilise Wikipedia's concept-relatedness information while de- 
termining whether the query concepts are present in the active 
document - either directly or indirectly. The way how the rule- 
evaluator operates is described in Section 14.31 

4. Returning the filtered documents to user. The documents, that are 
found to match the active Wiki-ES rule, are returned to the user. If 
the user is satisfied with the retrieved documents, then the process 
terminates. Otherwise, a new training data set is created using the 
final documents and the initial matching documents, and the system 
returns to step 1. 
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Figure 3: Wiki-ES flowchart 

Having provided a rough schematic overview of the Wiki-ES model, we 
are now ready to explain more closely the underpinnings of the Wiki-based 
Information Retrieval System (WIRS). The remainder of this section is orga- 
nized as follows. First, in Section ^^21 we discuss the Wikipedia-based content 
model used within the document modeling subsystem of WIRS. Then, the 
Section 14.31 continues by outlining the structure of the rule-matching sub- 
system in WIRS. In particular, we define what Wiki-ES rules are and how 
they are evaluated. Finally, we formalise the Wiki-ES learning problem in 
Section 14.41 The details of the learning algorithm are treated separately in 
Section O 

4-2. Wikipedia-based document model 

In Wiki-ES framework each document is represented by a collection of 
Wikipedia-concepts that are identified from its contents. The approach 



builds on the wikification technique proposed by Milne et al. [30|] and Medelyan 



et al. [26|, where a two-stage classifier is utilised to recognize those terms in 
the document which should act as Wikipedia-concepts. However, the model 
employed here extends the wikification-process by splitting the found con- 
cepts into two categories, general Wikipedia-concepts and named-entity con- 
cepts, using a named-entity recognizer. 
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To explain the rationale for this modification, consider, for example, a 
named entity "Goldman Sachs" and a general concept "Investment bank- 
ing" . Now, to say that a certain document discusses Goldman Sachs requires 
that the bank's name is explicitly mentioned. On the other hand, if we say 
that a document is about investment banking, it is sufficient to find a collec- 
tion of investment banking related concepts rather than the exact concept 
name to identify the document as relevant. Clearly, the different nature of 
general concepts and named-entities should be taken into account when spec- 
ifying the sensitivity of the Wiki-ES model to different concept types. Hence, 
in Wiki-ES, each document is interpreted as a pair of two collections: the 
named-entities and other Wikipedia concepts. 

Definition 4.2.1 (Wiki-ES document model). Let D be the space of doc- 
uments, and W denote the collection of Wikipedia- concepts . The document 
model is defined as the mapping 

A : d e D ^ {Na,Gd) C W X W, 

where Nd and Gd denote the sets of named- entities and general Wikipedia- 
concepts found in the document d. 

So, for example, if a document d E D contains Wikipedia-concepts, { Invest- 
ment banking, Goldman Sachs, Morgan Stanley, Mortgage, Credit }, we 
simply present the document model as split into two parts, A{d) = {Nd, Gd), 
where A^^^ = {Goldman Sachs, Morgan Stanley}, Gd = {Investment banking. 
Mortgage, Credit}. 

The document model A is implemented in two stages: wikification and 
named-entity recognition. Once the usual preprocessing steps have been 
carried out, the first stage is to identify all Wikipedia-concepts present in the 
document. At this stage, no separation between named-entities and general 
concepts is made. Here, identification is done using the wikification (or cross- 



referencing) technique proposed by Milne et al. [30[, where a sequence of two 
classifiers is run to detect which terms should be linked to Wikipedia and to 
which Wikipedia-concepts do they correspond. 

The second step, named-entity recognition (NER), is done by using the 
Conditional Random Fields (CRF)-based classifier proposed by Finkel et 
al. jof. An advantage of this model is that the system is able to augment 
non-local information which allows construction of long-distance dependency 
models and enforcement of label consistency. In this approach, the set of 
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Wikipedia-named-entities is identified by examining the overlap between the 
terms that have been picked-up by the wikification step and those recognized 
by the NER-classifier. 

4-3. Query model: structure and matching of Wiki-ES rules 

As mentioned in Section l4rT| each Wiki-ES rule can be viewed as a compo- 
sition of a number of queries. The Wiki-ES rule has an underlying structure 
that is essentially different from what is seen in ordinary boolean queries. 
To provide a more accurate picture, we formalise the definition of Wiki-ES 
rule as a voting system where several concept based queries go for a voting 
and the weighted sum of their votes is taken to represent the relevance of a 
document. 

The presentation of the Wiki-ES model is structured as follows. First, 
we define the Wiki-queries that are used as building blocks in Wiki-ES rule 
(Section 14.3. ip . Thereafter, in Section |4.3.2[ we introduce a fitness- measure 
for evaluating the quality of individual queries, and discuss how a voting 
system can be used to combine the output of several Wiki-queries to generate 
a Wiki-ES rule. Section 14.41 summarizes the Wiki-ES learning problem. We 
also discuss the benefits of constructing the Wiki-ES rule as a voting system 
instead of using the individual queries directly. 

4-3.1. Building blocks of Wiki-ES rules 

Now, we begin by outlining the types of boolean queries used as building 
blocks for the Wiki-ES rule. To distinguish these from ordinary term-based 
queries, we refer to them as Wiki-queries (concept-based queries) hereafter. 
Unlike an ordinary boolean query, a Wiki-query consists of two parts. In 
addition to the query-expression, each Wiki-query also contains a specialized 
evaluator function which allows the query to utilize Wikipedia's concept- 
relatedness information when it is matched against documents. 

Definition 4.3.1 (Wiki-query). A Wiki-query q : D {0,1} is defined 
by a pair (e, 6), where 

(i) the first component, e, is an ordinary query- expression that is defined in 
terms of Wikipedia- concepts V G W and the standard boolean operators 
by following the syntactic rules outlined in \3.2.1\ and 

(a) the second component, 6 : V x D ^ {0)1}; ^■^ concept- evaluator 



function given by 4-3.3, which determines whether a concept v eV is 



present in any given document d E D. 
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When matching the given query q = (e, S) against any document d, the value 
of the query q{d) is obtained by replacing each concept v eV in the query ex- 
pression e with the corresponding value S{v,d) given by the concept- evaluator. 

Example 4.3.2. Let q be defined by {e,5). If e — viriV2r2- • -rk-iVk, and 
Vi e W , ri G {A, V, -1} for all i = 1, . . . ,k, then the value of the query 
amounts to q{d) — S{vi, d)riS{v2, d)r2 ■ ■ ■ rk-iS{vk, d). 

Definition 4.3.3 (Concept-evaluator). The concept-evaluator function, 
S : V X D ^ {0,1}, whose purpose is to account for Wikipedia's concept- 
relatedness information when evaluating Wiki- queries, is given by 

il tfveA{d), 
S{v,d)=h ifveRel{d), 
y otherwise 

where 

Rel{d) = {v E V : d-rel{v,d) > Crei{v)}, 

and Crei > is a threshold function controlling the acceptance sensitivity by 
relatedness criteria. The threshold for document- concept relatedness function 
(d-rel) depends on the type of concept, i.e. whether it is a named-entity or 
general Wikipedia- article. If A{d) — {Nd,Gd), we have 

I V J ci ifv e Nd, 

[C2 ifveGd- 

Each sensitivity threshold is chosen based on training data. The purpose 
of the distinction between named-entities and general concepts is to allow 
stricter thresholds for named- entities which have narrower definitions than 
general concepts. 

To illustrate the underlying idea, consider a simple Wiki-query, q — (e, 5), 
where the query-expression 

e = Lawsuit A {Espionage V TradeSecret) A BMW 

requests for documents on industrial espionage that are concerned with BMW. 
Now, suppose that the following document d is received: 
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A civil court in Hamburg will give its verdict on Tuesday on a 
hearing called by Spiegel, a leading German magazine. Spiegel is 
trying to lift an injunction from VW preventing it from repeating 
allegations of corporate spying against Mr Lopez. ..The documents 
include top-secret details of Opel's new small car project, coded 
the 0-car, which is to rival Volkswagen's planned Chico. 

Once the document has been profiled, it can be evaluated against the query 
expression. In this case, during the concept-evaluation step, we find that 
5 [Law suited) = 1 because the terms "civil court" and "allegation" point to 
Lawsuit, and similarly we have 6{Espionage, d) = 1 because "spying" is a 
redirect to Espionage. However, the evaluation of the concept TradeSecret 
and the named-entity concept BMW turn out to be more problematic as 
they will depend on the acceptance-sensitivity function (crei). 

Let us first consider the TVac/e.S'ecre^-concept. To determine whether 
TradeSecret is present in the document, we need to examine its relatedness 
to other concepts that have been identified from the document. In the above 
excerpt "top-secret" is recognized as Classifiedlnformation which is strongly 
related to TradeSecret, therefore the decision boils down to the comparison 
of these two concepts. Here, 6{TradeSecret, d) equals 1 only if the accep- 
tance sensitivity c^c\{TradeSecret) is less than the link- relatedness measure 
between TradeSecret and Classifiedlnformation. 

So far, it seems that the given excerpt is almost a match provided that the 
last concept, BMW, is also recognized as related to the document. Now, the 
acceptance sensitivity parameter for named-entities Ci is set at a reasonably 
strict-level, say 0.95, to ensure that named-entities are not as broadly defined 
as the general concepts. For example, one would observe a very high relat- 
edness between BMW and VW as they are both German car manufacturers 
with almost similar link-structures. However, mixing these two would be a 
serious error from the user's point of view. Therefore, being able to define 
acceptance sensitivities separately for named-entities and general Wikipedia- 
concepts proves to be a useful tool. Eventually, due to high value of Ci, we 
deduce that 6{BMW, d) = 0, and therefore the document is considered to be 
irrelevant. 

4.3.2. Wiki-ES rule 

Having introduced Wiki-queries, we are now ready to explain how they 
are combined to generate a Wiki-ES rule. For this purpose, we define two 
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additional functions: (i) a fitness-function for measuring the quality of indi- 
vidual Wiki-queries; and (ii) a voting function for summarizing the output 
of a group of Wiki-queries into a single measure. 

Definition 4.3.4 (Fitness of Wiki-query). LetQ denote the space of ad- 
missible Wiki-queries. The fitness-function for a Wiki-query q E Q is defined 
as the mapping, F : {q,Dt) i— )■ c G [0,1], which corresponds to the F-score 
within a given set of evaluation documents G D: 



P(q,D,) + R(q,D,y 

where P{q, Dt) is the precision of the query in the document set Dt, and 
R{q, Dt) is the recall of the query, respectively. By denoting the relevance of 
a document d E Dt by r{d) e {0, 1}, precision and recall are defined as 

Now, suppose that instead of having a single query to describe the user's 
information need, we have several complementary queries for the same topic, 
where each query represents a part of the user's need. In order to benefit 
from the diversity provided by the multiple query representation, we first 
need to resolve how the potentially conflicting results from different queries 
can be combined into a single document-relevance measure. Given the above 
F-score as a fitness-measure for evaluating the quality of each individual 
Wiki-query, a natural approach for dealing with this "query fusion" problem 
is to consider the following voting function where each query contributes to 
the overall relevance judgement according to its relative fitness: 

Definition 4.3.5 (Voting function). Let A C Q be a finite collection of 
Wiki-queries. A voting function /ia '■ D ^ [0, 1] is given by 

where Fi — F{qi, Dt) is the fitness of query qi evaluated with respect to a 
training document set Dt <Z D. 
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Remark 4.3.6. The voting function /i^ can be also used for ranking the 
documents based on their relevance to the given topic. However, the use of 
rank-order information is left as a direction for further research. 

The value of the voting function has an interpretation as the joint-relevance 
of a document, where the judgement is based on several alternative queries 
that describe the given topic. If the value of the voting function is greater 
than 0.5, then the document is considered relevant, otherwise it is considered 
irrelevant. Using this weighted contribution, the information from several 
queries is taken into account, which helps to reduce the risk of overfitting the 
training document set with a single query. This discussion is formalised by 
the following definition of the Wiki-ES rule. 

Definition 4.3.7 (Wiki-ES rule). Let Q denote the space of admissible 
boolean queries formed using Wikipedia- concepts , and let be a voting 
function that evaluates the document-relevance based on a finite set of Wiki- 
queries, A G Q. Now, the Wiki-ES rule is defined as the function qa '■ D 



and the space of admissible Wiki-ES rules is given by Q = {qa \ A C Q}, 
where A denotes any finite set of Wiki-queries. 

Remark 4.3.8. At this point, it is worthwhile to note that any Wiki-query 
can be viewed as a Wiki-ES rule, i.e. Q C. Q, because for every Wiki-query 
go G Q, we have qiqo} € Q. Hence, the Wiki-ES rules provide a natural 
extension of the Wiki-queries. 

4.4- Wiki-ES as an optimization problem 

As discussed in Section 13. 3[ the query learning task can be viewed as a 
large optimization problem, where the search space consists of all possible 
queries that can be presented to the IRS. However, instead of considering 
optimization over the space of admissible Wiki-queries, we convert the query 
learning task into the problem of finding an optimal Wiki-ES rule which 
maximizes F-score with respect to the given collection of training documents. 



{0,1}.- 




otherwise 



1 if fiA{d) > 0.5, 
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Definition 4.4.1 (Wiki-ES learning problem). Let Df C D be the set 

of training documents for which user has given relevance statements, and let 
Q denote the space of Wiki-ES rules. The learning problem is given by 

q* = arg maxF(g, Dt) 

where F : (g, Dt) i— ?■ c G [0, 1] is the Wiki-ES fitness function, which cor- 
responds to the F-score within the training document set Dt; see Defini- 
tion 14-3^. 



The rationale for defining the learning problem in terms of Wiki-ES rules 
instead of Wiki-queries stems from the following reasons. The first one is the 
multimodality of the user's relevance function. As pointed out by Tamine 



et al. j46[, the relevant documents corresponding to the same topic can be 
dispersed into different regions of the document space, and thereby have quite 
different profiles. This implies that in order to recover the relevant documents 
it is necessary to explore the document space in a number of directions at 
the same time. Therefore, given the definition of a Wiki-ES rule as a voting 
system, it appears to be a natural solution for the multimodality problem as 
it utilises a number of Wiki-queries while making the retrieval decisions. 

The use of Wiki-ES is also motivated by the fact that unlike classical 
methods, GP-based approaches always operate with a population of queries 
rather than a single query. Therefore, we are likely to obtain better results by 
using several individuals from the population to represent the solution rather 
than rely on a single query candidate. Hence in order to solve the above 
optimization problem, we have chosen to use a co-evolutionary GP approach, 
where multiple subpopulations are evolved simultaneously to produce Wiki- 
queries that can be combined to produce an optimal Wiki-ES rule. The 
details of the algorithm are provided in Section |5l 

5. Wiki-ES GP-algorithm 

The aim of the proposed GP algorithm is to generate better fit queries 



using a mechanism inspired by biological evolution [35|. The approach is 
population based, where each individual represents a Wiki-query. The idea 
behind the technique is that, for a given population of individuals, the en- 
vironmental pressure causes natural selection leading to a rise in the fitness 
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of the population. Once the genetic representation of a query and the fit- 
ness function is defined, the algorithm proceeds to initialise a population of 
queries randomly. The population of Wiki-queries is then improved through 
repetitive application of Selection, Crossover, Mutation and Replacement. 
To ensure sufficient diversity and reduce the risk of over-fitting the training 
set, the population is evolved in a number of co-evolving sub-populations. 
The Wiki-ES rules are then formed by collecting the fittest individuals from 
each sub-population to form the set of queries that participate in the voting 
function. 

The remainder of this section is structured as follows. First, Section |5TT| 
begins by providing the genetic representation of Wiki-queries as syntax trees. 
Next, the initialization of the query populations is discussed in Section 15.21 
Fitness assignment and the production of new queries is covered in Sec- 
tions 15.31 and 15.41 Finally, the structure of the evolutionary algorithm is 
presented in Section 15. 5[ which is followed by a short discussion on the for- 
mation of Wiki-ES rules in Section 15.61 

5.1. Genetic Representation 

Each query is expressed as a syntax tree with the nodes acting as boolean 
operators and the the terminals as the concepts; see Table [1] for correspon- 
dence between the common GP components and the Wiki-queries. Figure H] 
shows one such query which acts as an individual in the population. The 
query shown in the figure is composed of four concepts, {^1,^2,^3,^4}, 
and the basic boolean operators, {AND.,OR,NOT}. The tree represents 
a boolean expression {wi A W2) V {w^ A (-11(74)). Such a query will lead to 
the selection of those documents from the library which either contain the 
concepts Wi and W2 or it contains the concept but not tf4. Each tree has 
a depth which is a representative of the size of a tree. The depth of a tree is 
the number of branches traversed to reach the deepest terminal. The tree in 
the Figure H] has as the deepest terminal and the depth of the tree is 3. 
It should be noted that the depth of a root node is 0. 

5.2. Population Initialization 

Like in any evolutionary algorithm, the initial population individuals are 
generated randomly in genetic programming. The maximum depth {dmax), 
an individual can have, is given as input. A number d is chosen randomly 
from the set {1, 2, 3 ... , dmax}- The chosen number becomes the depth of the 
tree (individual) to be initialized. Starting from the root node, an operator 
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Boolean Expression: (wl AND w2) OR (w3 AND (NOT w4)) 
Figure 4: Genetic Representation 



Table 1: The interpretation of GP-components in Wiki-query context 



GP component 


Meaning in Wiki-qucry 


Terminals (leaf nodes) 
Functions (non-leaf nodes) 
Fitness function 

Reproduction, crossover, and 
mutation 


Wikipedia-concepts in a query-tree 
Boolean query operators (AND, OR, NOT) 
The objective function (F-score) in the 
query learning problem 

Genetic operators for driving the development 
of Wiki-queries according to the evolutionary 
principles. 



is chosen randomly from the set O = {AND, OR, NOT}, and placed at the 
node. If the node turns out to be AND or OR, then two sub nodes are 
created; otherwise a single sub node is created. The procedure is repeated 
for each of the sub nodes and the tree size grows. At a depth d — 1, a. 
terminal should be chosen to terminate the growth of the tree. Therefore, 
random choice is made from the set Wq = {wi, W2, ■ ■ ■ ,Wk} and the concept 
is placed at the terminal. This completes the procedure to generate a single 
individual. Following a similar procedure, a number of individuals equal to 
the population size N are generated; the next step is to assign fitness to each 
individual. Figure |5] shows the steps involved in initialising an individual of 
depth 2. 

5.3. Fitness Assignment 

As already mentioned, the set Wq = {wi, W2, ■ ■ ■ , Wk} is created by scan- 
ning through the training set of documents and choosing the most relevant 
concepts which give a good representation of the training set. Once a random 
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query is composed using members from the set Wo and the basic boolean op- 
erators, the query can be evaluated by verifying it against the training set. 
The boolean query is applied to each of the document in the training set, 
and the query predicts the document as relevant or irrelevant. The number 
of correct relevant or irrelevant predictions leads to the fitness for the query. 
The algorithm searches for those queries which provide the maximum num- 
ber of correct predictions. Degeneracy often exists, as there is a possibility 
of more than one query producing the same results and therefore having the 
same fitness. 

5.4- Producing New Queries 

New queries or offsprings are produced from the parent queries by means 
of crossover and mutation. A crossover method is chosen such that two 
parents result in two offsprings. The crossover is performed by randomly 
choosing a crossover point in each parent tree. Once the crossover points 
are chosen, the offsprings are created by swapping the subtree rooted at the 
crossover point of one parent with the subtree rooted at the crossover point 
of the other parent. Figure E] shows two parents and the crossover operation. 
The subtrees to be swapped are shown shaded in the figure. Swapping the 
two shaded subtrees produce the offsprings. 

Once the crossover operation is performed and the offsprings are pro- 
duced, they undergo a mutation operation. A point mutation operation has 
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Parent 1 



Parent 2 




Figure 6: Crossover 



been used where each node is considered in turn, and with a particular prob- 
abihty the primitive stored at the node is replaced with another randomly 
chosen primitive of the same arity0. The mutation operation has been shown 
in Figure [7]for the second offspring produced from crossover. Making a choice 
based on a mutation probability, the nodes with primitive Wi and OR get 
chosen, wi is replaced by a random member from the set, {wi, W2, ■ ■ ■ , Wio} 
and OR is replaced by a random member from the set {OR, AND}. The 



"'^Arity means the number of arguments a function can take. In a query, a NOT gate 
cannot be mutated with an OR or AND gate as NOT takes a single argument as input 
and on the other hand AND and OR take two arguments as input. 
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crossover and mutation operation together produce the final members which 
compete with other members to enter the population based on their fitness. 




Figure 7: Mutation 



5.5. Algorithm Description 

The proposed algorithm follows the framework of a general evolutionary 
algorithm. Instead of having a single population, the algorithm maintains 
multiple sub-populations which interact with each other during the opti- 
mization run. The algorithm terminates when the prescribed number of gen- 
erations are completed. At the end of the optimization run, the algorithm 
provides elites from each of the subpopulations as final solutions. These elites 
are expected to represent different niches in the search space. Each elite rep- 
resents a Wiki-query which participates in the formation of a Wiki-ES rule. 
Multiple queries are accepted as solutions from the algorithm, as we do not 
wish to rely on a single query. For any document, output of each query is 
taken into account through the voting function and the decision for relevance 
or irrelevance is made. A fiowchart for the proposed genetic programming 
algorithm has been presented in Figure [Hi In the following, we also discuss a 
stepwise procedure for implementing the algorithm. 

1. Initialize M different sub-populations randomly. Each sub-population 
contains n number of individuals. It is noteworthy that the choice of M 
determines the number of Wiki-queries participating in the Wiki-ESR 
rule, i.e. M = \A\ in the Definition 14.3.71 

2. Assign fitness to all the initialized individuals. 

3. Initialise a generation counter Gen = 0. 

4. If Gen is less than maximum number of prescribed generations then go 
to Step 5, otherwise go to Step 16 

5. Increment the generation counter by 1, Gen = Gen + 1 
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Figure 8: Flowchart for GP algorithm. 



6. Initialise a sub-population counter S = 0. 

7. If S is less than number of sub-populations M then go to Step 8, 
otherwise go to Step 4 

8. Increment the sub-population counter by 1, S = S + 1. 

9. Initialise an offspring counter Off — 0. 

10. Choose two individuals randomly from sub-population S, perform a 
tournament and choose the better individual as one of the members for 
crossover. 

11. Generate a random number between and 1. If the value is less than 
1/M, then choose two individuals randomly from subpopulation other 
than S, otherwise choose two individuals randomly from the subpop- 
ulation S. Perform a tournament and choose the winner as the other 
member for crossover. 

12. Perform crossover with a crossover probability Pc- This produces two 
offsprings. 

13. Mutate the offsprings with a mutation probability pm. 

14. Increment the offspring counter by 2, Off = Off + 2. 

15. If offspring count, Off is equal to n, then combine the offsprings and 
the individuals from the sub-population S into a pool. Choose the n 
best members from the pool, copy it into the subpopulation S and go 
to Step 7. If offspring count, Off is less than n then go to Step 10 

16. Choose the best members from each sub-population as final solutions. 

5.6. Formation of Wiki-ES rules 

As already mentioned, the suggested CP algorithm produces multiple 
queries as its output. If the number of sub-populations in the algorithm is 
chosen as M, then the number of final queries are also M in number. Given 
a document, each query suggests it as either relevant or irrelevant. However, 
we do not want to rely on a single query, rather wish to take a weighted 
contribution of each of the queries before making a final decision. Let each 
of the query be represented by : ?' G {1, 2 . . . , M} and the associated fitness 
be represented by Fj : i e (1, 2 . . . , M}. For any given document d, if we 
need to decide whether it is relevant or irrelevant, output of each of the 
query is considered. Let the output of each query for the document d be 
bi : i e {1,2..., M}, where bi is either or 1. Now a weighted contribution 
of the queries is accounted in the following metric /i: 

Fb- 

2^1=1 
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If the value of the metric /i is greater than 0.5 then the document is con- 
sidered relevant, otherwise it is considered irrelevant. Using this weighted 
contribution, the information from various niches are taken into account and 
overfitting of a query to the training document set is also avoided. 

6. Experiment and results 

To demonstrate the benefits of using Wiki-ES rules, we evaluate the sys- 
tem by using the topics in TREC-11 corpus. The experiment is structured as 
follows. First, we begin with description of the data set in Section |6TT| which 
is followed in Section 16.21 by an account on the software components used to 
implement the Wiki-ES system. The parameter setup of the GP algorithm is 
outlined in Section 16.31 The results from the comparison of Wiki-ES against 
competing algorithms are presented in Section 1^31 In particular, we illustrate 
the benefits of using Wikipedia-concepts for query learning by benchmarking 
the performance of Wiki-ES against a corresponding term-based model. 

6.1. Data 

The documents included in TREC-11 corpus are Reuters RCVl news 
stories from years 1996-1997. The data is partitioned into a training set 
(items dated between 1996-08-20 to 1996-09-30) and a test set (remainder of 
the collection). The training and test set are further divided into 100 topic- 
specific subsets. All 100 TREC-11 topics (numbered R101-R200) are used in 
the experiment. In this paper, only the initial training data is used, while 
the relevance statements available for adaptive learning are not utilized. Also 
none of the information in the separately available topic description file is 
used. 

Given that query learning techniques tend to be highly dependent on the 
quality and amount of training data, it is worthwhile to take a closer look 
at the data available for the 100 TREC-11 topics. Figure [9] shows two his- 
tograms displaying the number of training and evaluation documents for each 
topic. To describe how data sets are balanced between relevant and irrele- 
vant documents, the frequency bars are split to reflect their proportions in 
both data sets. On average there are 12 relevant and 39 irrelevant document 
examples in the training data, and 90 relevant and 713 irrelevant in the eval- 
uation set. However, the variation between topics is quite drastic, especially 
in the evaluation set. As it can be seen from the histogram, the first 50 topics 
have a large evaluation set as compared to the remaining topics. It can also 
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be seen that some topics are highly imbalanced, in the sense that there is 
only a handful of relevant documents for hundreds of irrelevant items, e.g. 
in the case of topic R137 less than 1% of the documents are relevant in the 
evaluation set. Then on the other extreme, a few topics (e.g. R175) are very 
loosely defined with majority of the documents being relevant. When consid- 
ering the performance of the Wiki-ES model, as well as the benchmarks, both 
the quantity and balance of training data play important roles. In general, 
topics with relatively large proportion of relevant examples in the training 
data fare better than the ones with very few relevant items. The topics with 
few relevant documents provide good test-cases for evaluating the efficacy of 
the algorithm. 
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Figure 9: The number of relevant / irrelevant documents in TREC-11 topics. 



6.2. System description 

The system used in the experiment was implemented using Java software 
on top of the GATE platform, which provides tools for standard document 
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preprocessing tasks. The other software components used in the implemen- 
tation and evaluation of Wiki-ES framework are described as follows: 



Wikipedia-model: The Wikipedia-based content model was built using 



the WikipediaMiner published by Milne et al. [3l|, which was suitably 
modified and integrated into our framework. 

NER: The named-entity recognition task was carried out using a Con- 
ditional Random Field (CRF) classifier proposed by Finkel et al. jof. 



• Genetic programming: The co-evolutionary GP algorithm described 
in Section E] was implemented using the JGAP toolbox provided by 
Meffert et al. (27|. 

• Classifiers: The classifiers (SVM and C4.5) used as benchmarks in the 
experiment, where implemented using Weka [l5| through Java-ML [H 
package. 



6.3. Parameter setting 

The GP procedure used in the paper has the usual genetic programming 
parameters like population size, crossover probability, mutation probability, 
etc. The parameter setting used in this experiment is given in Table |2l 



Table 2: GP parameters 



Parameter name 


Value 


Number of generations, G 


250 


Number of sub-populations, M 


10 


Sub-population size, N 


100 


Crossover probability, pc 


0.9 


Mutation probability, pm 


0.9 


Initial tree depth 


4 


Maximum crossover depth 


8 



In addition to the general GP parameters, we have used 15 as the maxi- 
mum size for the terminal set while constructing query trees. That is, when 
building the queries, the maximum number of different Wikipedia-concepts 
that could appear in a single Wiki-query was limited to 15. The choice of 
Wikipedia-concepts for each topic was carried out by selecting the ones that 
appear most frequently in the relevant training documents. 
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6.4- Results 

In this section, we present the results from two experiments carried out us- 
ing TREC-11 data. The first experiment, discussed in Section l6.4.H examines 
the importance of using Wikipedia-concepts in Wiki-ES rules by comparing 
them against the results obtained by running the same algorithm with bag-of- 
words document model. By using the bag-of-words profile in the competing 
model we get an effective comparison against the established IQBE-paradigm. 
The second experiment, presented in Section r6.4.2[ evaluates the benefits of 
Wiki-ES model in comparison to the well-known classification models based 
on Support Vector Machines (SVM) and the decision-tree algorithm C4.5. 
As performance measures, we have used F-score, precision and recall which 
are defined as in 14.3.41 

6.4- 1- Experiment 1: Effect of Wikipedia semantics 

Given that the main contribution of the Wiki-ES framework is the inte- 
gration of Wikipedia's knowledge into the query learning problem, the first 
question to ask is how much the retrieval results have been improved by 
the infusion of the semantic information. In order to quantify the effect, we 
consider an experiment where the co-evolutionary GP-algorithm is run with 
two alternative content models: the Wikipedia-based model and the bag-of- 
words model. This allows us to eliminate the effect of the algorithm and 
focus on the improvement following from the concept-based representation 
of documents and queries. 

The key performance measures are summarized in Table [3], where Token- 
GP refers to the model using the bag-of-words representation. The results are 
computed as averages across all 100 topics. A direct comparison shows that 
Wiki-ES yields an improvement of 62% in F-score when compared with the 
Token-GP model. Interestingly, when comparing the results with respect to 
precision and recall, we find that most of the reported difference in F-score is 
due to better recall of Wiki-ES, while precision and accuracy are roughly the 
same. After all, recognizing the way how the concept-relatedness measure is 
utilised in the evaluation of Wiki-queries, the outcome was anticipated due to 
the ability of Wiki-queries to match such documents as well which contain a 
closely related concept that would have been ignored by a word based search. 
Whereas in the case of Token-GP based rules, it is required that the words 
in the query expressions are directly detected, which is likely to weaken their 
ability to match relevant documents. 
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Table 3: Results for Wiki-ES and Token-GP Algorithms 



Algorithm 


F-Scorc 


Precision 


Recall 


Accuracy 


Wiki-ES 


0.4218 


0.4104 


0.5200 


0.8436 


Token-GP 


0.2596 


0.4002 


0.2925 


0.8466 



To provide a better idea on the differences in F-score for the two algo- 
rithms across the individual topics, Figures [10] and [11] show the difference 
for Wiki-ES minus Token-GP. Positive bars in the figures indicate the top- 
ics where the use of Wikipedia's semantics has been beneficial in terms of 
F-score, recall and precision. The reason for splitting the evaluation into 
subfigures stems from the characteristics of the topics. The first half of 
the dataset (R101-R150; Figure [TO]) represents topics where the individual 
query expressions participating in the Wiki-ES rules tend to have more com- 
plicated structures. In particular, they commonly feature conditions that 
would require the use of NOT-gate to construct the query expressions. For 
example, in topic R120, we are looking for documents on deaths of mine 
workers where the death has occurred due to a mining accident and is not 
related to an ethnic clash between miners. Overall, we find that the various 
constraints involved in the first 50 topics make them tougher for both mod- 
els. However, when comparing the performance differences, it appears that it 
is exactly these difficult topics where the Wikipedia-based approach has the 
largest edge over Token-GP. For topics R101-R125 the average percentage 
improvement in F-score is 91.37% and 82.51% for topics R126-R150 in favor 
of Wiki-ES, which are both considerably larger than the improvement across 
all of the topics. 

Also the results reported for the remaining topics (R151-R200) show that 
the use of Wikipedia-concepts has improved the F-scores substantially; see 
Figure [TTl However, the average percentage difference in F-score is 54.57% for 
topics R151-R175 and 38.57% for R176-R200. This suggests that although 
both models achieve higher F-scores than previously, the difference in their 
performance has essentially narrowed down on these simpler topics. When 
examining Figure [Til we observe that even Token-GP has often performed 
well in terms of precision for these last 50 topics. However, Wiki-ES is still 
clearly outperforming in terms of improved recalls. 

To summarize, the experiment lends support for two conclusions. First, 
the use of Wikipedia's concept information appears to have a substantial 



35 



0.5 




Q 

-0.5 
1 

^ 0.5 

Q -0.5 
-1 



[In 



1^ II u 



im. 



nil 







nfinn_„n On 







f _ 

r 




0.5 



-0.5 
1 

0.5 



-0.5 
-1 
1 







nnOn _ L 


. n on nr 


u 




y " 






D< D< D< D< D< 



OS OS D< D< D< 



Figure 10: Differences in F-score (F), Precision (P), and Recall (R) between models Wiki- 
ES and Token-GP for topics R101-R150. 



effect on tlie performance of tlie Wiki-ES framework. The improvement 
stems from the abihty of the rules to achieve higher recalls without losing too 
much precision. Second, based on the current results, it turns out that Wiki- 
ES 's strengths are best pronounced when the query learning problem includes 
strict constraints that the system should be able to figure out. This combined 
with narrow topic definitions and meager supply of relevant documents are 
conditions that characterize the use-cases where the Wiki-ES rules achieve 
considerably better overall performance than token-based rules. 

6.4-2. Experiment 2: Comparison with classification models 

The purpose of the second experiment is to compare the performance of 
Wiki-ES model against two well-known classification algorithms, SVM and 
C4.5. In order to also evaluate the effect of feature selection, the benchmark 
algorithms are trained using both token-based (bag-of-words) document rep- 
resentations and wiki-based document model. The support vector algorithms 
are referred to as Token- SVM and Wiki-SVM, and the decision-tree algo- 
rithms are denoted by Token-C4.5 and Wiki-C4.5, respectively. 

The results are summarized in Table H] where key performance measures 
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Figure 11: Difference in F-score (F), Precision (P), and Recall (R) between models Wiki- 
ES and Token-GP for topics R151-R200. 



are reported for each of the 5 models. A general comparison of the mod- 
els suggests that the Wiki-ES framework consistently outperforms its bench- 
marks in terms of F-scorc. Once again, the primary cause for the performance 
advantage appears to be the improved recall of Wiki-ES rules. Whereas 
SVM-based models appear to yield better results if only precision would be 
considered. However, the recalls of Token-SVM and Wiki-SVM are quite 
poor, which leads to an overall modest performance. The differences in ac- 
curacies are relatively small for all of the models. 



Table 4: Results for Wiki-ES, Token-C4.5, Token-SVM, Wiki-C4.5 and Wiki-SVM 



Algorithm 


F-Scorc 


Precision 


Recall 


Accuracy 


Wiki-ES 


0.4218 


0.4104 


0.5200 


0.8436 


Token-C4.5 


0.2849 


0.2770 


0.3730 


0.8048 


Token-SVM 


0.2215 


0.5755 


0.2098 


0.8863 


Wiki-C4.5 


0.3150 


0.3478 


0.3678 


0.8386 


Wiki-SVM 


0.2530 


0.5649 


0.2290 


0.8868 
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Table 5: Performance matrix showing the performance of each algorithm when compared 
with the other algorithms. The comparison is computed as the relative difference in F- 
scores, 100 x {Faigoi - Faigo2) / Faigo2, where Faigoi is the average F-score of the algorithm 
in the column and Faigo2 is the average F-score of the algorithm in the row. 



Algorithm 


Wiki-ES 


Token-GP 


Token-G4.5 


Token-SVM 


Wiki-C4.5 


Wiki-SVM 


Wiki-ES 


0% 












Token-GP 


62.48% 


0.00% 










Token-C4.5 


48.07% 


-8.87% 


0.00% 








Token-SVM 


90.43% 


17.21% 


28.61% 


0.00% 






Wiki-C4.5 


33.91% 


-17.58% 


-9.56% 


-29.68% 


0.00% 


-19.67% 


Wiki-SVM 


66.69% 


2.60% 


12.58% 


-12.46% 


24.48% 


0.00% 



Finally, to consider the effect of training data on the benchmark algo- 
rithms, we have computed relative differences in F-scores between each pair 
of models. The results are presented in Table [51 For the sake of complete- 
ness Token-GP is also included in the comparison. A quick overview suggests 
the following observations. First of all, we find that the use of Wikipedia- 
concepts in document models had a positive effect on the results for all the 
algorithms. However, there is a substantial difference in the size of the effects. 
The effect of Wikipedia-concepts is large between Wiki-ES and Token-GP, 
but the corresponding comparisons for pairs Token-SVM vs Wiki-SVM and 
Token-C4.5 vs Wiki-C4.5 show only modest improvements. This is best ex- 
plained by the fact that SVM and C4.5 based algorithms are not able to use 
concept-relatedness information while classifying documents into relevant or 
irrelevant. As discussed in Section I4.3[ it is the evaluation stage of Wiki- 
ES rules that makes them substantially different from classical approaches. 
Therefore, we can conclude that it is not only the Wikipedia-based docu- 
ment profile which makes Wiki-ES a powerful technique, but also the way 
the Wiki-ES rules utilise Wikipedia's concept-relatedness information while 
matching documents. 

7. Conclusions 

The purpose of any automated query learning system is to help the user 
define a query that finds the items relevant to her topic. A plethora of studies 
exist in this direction which have been discussed in the paper. We have 
also discussed that the conventional frameworks lack the the concept level 
information contained in a word or token. Some studies have made the use of 
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intelligent systems, still it has been difficult for them to significantly improve 
the performance of the information retrieval frameworks. This suggests that 
all these information retrieval systems inherently lack an important feature 
as they do not utilize the concept based information which prevents the 
improvement beyond a certain point. 

The proposition of accessing concept level information through Wiki- 
pedia, made in this paper, provides a simple and fast technique to ingress 
the human- and- society level information into an information retrieval sys- 
tem. Wikipedia is a free and universally available database of information, 
which is frequently updated by the Wikipedia-community. This saves the 
cost and time required to maintain any such encyclopedia which justifies our 
choice of using Wikipedia. The implementation of Wikipedia semantics in 
constructing a query has produced significant improvement in results. To 
provide a justification for the generality of the suggestion for all the existing 
information retrieval systems, the idea has been implemented on two other 
existing frameworks leading to an improved performance. Hybridizing the 
concept-based-query idea with an intelligent system, genetic programming in 
this case, is able to produce results substantially better than what has been 
reported earlier. From the results it has also been observed, that the Wiki- 
ES framework is able to perform much better than its counterparts on the 
difficult topics in particular. The results obtained in the paper are promising, 
and the proposition made is generic, which should encourage future research 
in this direction. Emphasis is needed towards equipping a query with con- 
cept based knowledge, which should be able to eliminate the barriers faced 
by the contemporary information retrieval frameworks. 
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